Fix/arc e2g vroom segfault by kaybrand · Pull Request #102 · EngreitzLab/scE2G

kaybrand · 2026-04-16T22:49:03Z

readGeneric() from the genomation package uses vroom_() for lazy loading. vroom uses a memory map that was implemented in C++. On a distributed system, threads may find themselves using an outdated memory map. This patch prevents a race condition that caused a segmentation fault when one thread tried to read from a memory address that another thread had just freed and the OS threw an error to prevent data corruption.

Four scripts previously using readGeneric are modified to use makeGRangesFromDataFrame(fread(path_to_data) instead. The data are 0-indexed, requiring us to tell makeGRangesFromDataFrame() that starts.in.df.are.0based = TRUE. GRanges prefers 1-indexed data, so it adds 1 to the start position as it reads in the data. Adjusted scripts must also subtract 1 from the start position before writing data to maintain the use of 0-indexed convention in scE2G.

1-indexed: [start, end]
0-indexed: [start, end)

… segfault genomation::readGeneric() calls vroom::vroom(altrep=lazy), which uses mmap() to lazily map files into memory. On Lustre (OAK), lock revocation or client cache eviction can invalidate mmap'd pages, causing SIGSEGV with SEGV_ACCERR in the vroom C++ layer — a hardware fault below R's managed memory that cannot be caught or recovered from. Replace both readGeneric() calls with fread() + makeGRangesFromDataFrame(): - bed.narrowPeak: fread(select=1:3) reads only chr/start/end, matching readGeneric's default keep.all.metadata=FALSE behavior. - bed.allPutative: fread(select=c("chr","start","end","TargetGene")) reads only the 4 columns actually used (findOverlaps coordinates + TargetGene assignment at line 33), skipping ~20 unused ABC columns. This recovers the practical benefit of vroom's lazy column parsing without mmap. Both reads now use starts.in.df.are.0based=TRUE to correctly convert 0-based BED [a, b) to 1-based inclusive GRanges [a+1, b]. The old code loaded BED start directly as a 1-based GRanges start, which elongated every region by 1bp and caused spurious findOverlaps hits at region boundaries. Output coordinate preservation: - PeakName construction uses start(gr) - 1L to emit the original 0-based BED start. - df.pairs.e2g$start is decremented by 1 before fwrite, restoring 0-based BED format in the output TSV. - On-disk output is format-identical to previous runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ap segfault Same vroom mmap segfault fix as make_kendall_pairs.R (see prior commit for full root cause description). Replace readGeneric() with fread(select=c("chr","start","end","PeakName")) + makeGRangesFromDataFrame(starts.in.df.are.0based=TRUE). Only the 4 columns actually used are read; TargetGene and PairName are skipped. Replace library(genomation) with library(data.table) for fread(). ATAC matrix rowname fix: Signac::FeatureMatrix derives rownames via GRangesToString, which emits 1-based GRanges starts. Under the old (buggy) code, the GRanges 1-based start numerically equalled the 0-based BED start, so rownames accidentally matched the 0-based PeakName strings in Pairs.Kendall.tsv.gz. With the corrected coordinate loading (starts.in.df.are.0based=TRUE), GRanges starts are now bed_start+1, so default Signac rownames would be chr-(a+1)-b while PeakName is chr-a-b. After FeatureMatrix, rownames are explicitly set to paste(seqnames, start-1L, end, sep="-") to restore 0-based BED naming. FeatureMatrix preserves input feature order, so this is a safe positional rename. compute_kendall.R matches mcols(bed.E2G)[,"PeakName"] %in% rownames(data.ATAC), so the two must agree on coordinate convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gfault Same vroom mmap segfault fix as prior commits (see make_kendall_pairs commit for full root cause description). Replace readGeneric(kendall_pairs_path, keep.all.metadata=T, header=T) with makeGRangesFromDataFrame(fread(kendall_pairs_path), keep.extra.columns=TRUE, starts.in.df.are.0based=TRUE). All columns from kendall_pairs are used in this script, so no select= narrowing. Remove library(genomation) from imports. Output coordinate preservation: df.pairs.E2G$start is decremented by 1 before fwrite, restoring 0-based BED format in Pairs.Kendall.tsv.gz. This file is consumed by compute_arc_e2g.R, which expects 0-based BED coordinates. PeakName is carried through from the input file's metadata column (not reconstructed), so it remains in its original 0-based format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gfault This is the script that originally triggered the investigation: 1 of 4 scE2G runs segfaulted in vroom_() C++ code during the arc_e2g rule (SIGSEGV with SEGV_ACCERR, "invalid permissions"). Root cause: vroom's mmap-based lazy reading on Lustre, where lock revocation invalidates mmap'd pages below R's memory management layer. Replace readGeneric(abc_predictions_path) with makeGRangesFromDataFrame(fread(abc_predictions_path), keep.extra.columns=TRUE, starts.in.df.are.0based=TRUE). All ABC columns are used (AddMax, IntegrateABC), so no select= narrowing. Replace GRanges(fread(kendall_predictions_path)) with makeGRangesFromDataFrame(fread(kendall_predictions_path), keep.extra.columns=TRUE, starts.in.df.are.0based=TRUE). The old GRanges(fread()) call also misinterpreted 0-based BED starts. Remove library(genomation) from imports. Coordinate correction: The old code loaded BED [a, b) as GRanges [a, b] (treating the 0-based start as 1-based). This elongated every region by 1bp, causing findOverlaps in AddMax to produce spurious hits where two regions share a boundary but don't overlap in BED semantics. Quantified on igvf10/thp1_2: 1,502 of 10,303,415 ABC rows (0.015%) were affected; max ARC.E2G.Score ratio old/new = 2.05; max absolute score change = 0.0045. Output coordinate preservation: df.output$start is decremented by 1 before fwrite, restoring 0-based BED format in EnhancerPredictionsAllPutative_ARC.tsv.gz. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kaybrand · 2026-04-16T22:53:07Z

We will come back to this at a later date.

kaybrand and others added 4 commits April 16, 2026 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/arc e2g vroom segfault#102

Fix/arc e2g vroom segfault#102
kaybrand wants to merge 4 commits intoEngreitzLab:mainfrom
kaybrand:fix/arc-e2g-vroom-segfault

kaybrand commented Apr 16, 2026

Uh oh!

kaybrand commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaybrand commented Apr 16, 2026

Uh oh!

kaybrand commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant